# Consider the following libraries for this exercise sheet:
library(ggplot2)
library(mlr3verse)
library(mlr3learners)
library(mlr3viz)
library(quantreg)Exercise 2 – Regression
$$
% math spaces % N, naturals % Z, integers % Q, rationals % R, reals % C, complex % C, space of continuous functions % machine numbers % maximum error % counting / finite sets % set 0, 1 % set -1, 1 % unit interval % basic math stuff % x tilde % argmax % argmin % argmax with limits % argmin with limits% sign, signum % I, indicator % O, order % partial derivative % floor % ceiling % sums and products % summation from i=1 to n % summation from i=1 to m % summation from j=1 to p % summation from j=1 to p % summation from i=1 to k % summation from k=1 to g % summation from j=1 to g % mean from i=1 to n % mean from i=1 to n % mean from k=1 to g % product from i=1 to n % product from k=1 to g % product from j=1 to p % linear algebra % 1, unitvector % 0-vector % I, identity % diag, diagonal % tr, trace % span % <.,.>, scalarproduct % short pmatrix command % matrix A % error term for vectors % basic probability + stats % P, probability % E, expectation % Var, variance % Cov, covariance % Corr, correlation % N of the normal distribution % dist with i.i.d superscript
% … is distributed as …
% X, input space % Y, output space % set from 1 to n % set from 1 to p % set from 1 to g % P_xy % E_xy: Expectation over random variables xy % vector x (bold) % vector x-tilde (bold) % vector y (bold) % observation (x, y) % (x1, …, xp) % Design matrix % The set of all datasets % The set of all datasets of size n % D, data % D_n, data of size n % D_train, training set % D_test, test set % (x^i, y^i), i-th observation % {(x1,y1)), …, (xn,yn)}, data % Def. of the set of all datasets of size n % Def. of the set of all datasets % {x1, …, xn}, input data % {y1, …, yn}, input data % (y1, …, yn), vector of outcomes % x^i, i-th observed value of x% y^i, i-th observed value of y % (x1^i, …, xp^i), i-th observation vector % x_j, j-th feature % (x^1_j, …, x^n_j), j-th feature vector % Basis transformation function phi % Basis transformation of xi: phi^i := phi(xi)
%%%%%% ml - models general % lambda vector, hyperconfiguration vector % Lambda, space of all hpos % Inducer / Inducing algorithm % Set of all datasets times the hyperparameter space % Set of all datasets times the hyperparameter space % Inducer / Inducing algorithm % Inducer, inducing algorithm, learning algorithm
% continuous prediction function f % True underlying function (if a statistical model is assumed) % True underlying function (if a statistical model is assumed) % f(x), continuous prediction function % f with domain and co-domain % hypothesis space where f is from % Bayes-optimal model % Bayes-optimal model% f_j(x), discriminant component function % f hat, estimated prediction function % fhat(x) % f(x | theta) % f(x^(i)) % f(x^(i)) % f(x^(i) | theta) % fhat_D, estimate of f based on D % fhat_Dtrain, estimate of f based on D %model learned on Dn with hp lambda %model learned on D with hp lambda %model learned on Dn with optimal hp lambda %model learned on D with optimal hp lambda
% discrete prediction function h % h(x), discrete prediction function % h hat % hhat(x) % h(x | theta) % h(x^(i)) % h(x^(i) | theta) % Bayes-optimal classification model % Bayes-optimal classification model
% yhat % yhat for prediction of target % yhat^(i) for prediction of ith targiet
% theta % theta hat % theta vector % theta vector hat %% %theta learned on Dn with hp lambda %theta learned on D with hp lambda % min problem theta % argmin theta
% densities + probabilities % pdf of x % p % p(x) % pi(x|theta), pdf of x given theta % pi(x^i|theta), pdf of x given theta % pi(x^i), pdf of i-th x
% pdf of (x, y) % p(x, y) % p(x, y | theta) % p(x^(i), y^(i) | theta)
% pdf of x given y % p(x | y = k) % log p(x | y = k)% p(x^i | y = k)
% prior probabilities % pi_k, prior% log pi_k, log of the prior % Prior probability of parameter theta
% posterior probabilities % P(y = 1 | x), post. prob for y=1 % P(y = k | y), post. prob for y=k % pi with domain and co-domain % Bayes-optimal classification model % Bayes-optimal classification model % pi(x), P(y = 1 | x) % pi, bold, as vector % pi_k(x), P(y = k | x) % pi_k(x | theta), P(y = k | x, theta) % pi(x) hat, P(y = 1 | x) hat % pi_k(x) hat, P(y = k | x) hat % pi(x^(i)) with hat% pi_k(x^(i)) with hat % p(y | x, theta) % p(y^i |x^i, theta) % log p(y | x, theta) % log p(y^i |x^i, theta)
% probababilistic% Bayes rule % mean vector of class-k Gaussian (discr analysis)
% residual and margin % residual, stochastic % epsilon^i, residual, stochastic % residual, estimated % y f(x), margin % y^i f(x^i), margin % estimated covariance matrix % estimated covariance matrix for the j-th class
% ml - loss, risk, likelihood % L(y, f), loss function % L(y, pi), loss function % L(y, f(x)), loss function % loss of observation % loss with f parameterized % loss of observation with f parameterized % loss of observation with f parameterized % loss in classification % loss in classification % loss of observation in classification % loss with pi parameterized % loss of observation with pi parameterized % L(y, h(x)), loss function on discrete classes % L(r), loss defined on residual (reg) / margin (classif) % L1 loss % L2 loss % Bernoulli loss for -1, +1 encoding % Bernoulli loss for 0, 1 encoding % cross-entropy loss % Brier score % R, risk % R(f), risk % risk def (expected loss) % R(theta), risk % R_emp, empirical risk w/o factor 1 / n % R_emp, empirical risk w/ factor 1 / n % R_emp(f) % R_emp(theta) % R_reg, regularized risk % R_reg(theta) % R_reg(f) % hat R_reg(theta) % hat R_emp(theta) % L, likelihood % L(theta), likelihood % L(theta|x), likelihood % l, log-likelihood % l(theta), log-likelihood % l(theta|x), log-likelihood % training error % test error % avg training error
% lm % linear model % OLS estimator in LM
% resampling % size of the test set % size of the train set % size of the i-th test set % size of the i-th train set % index vector train data % index vector test data % index vector i-th train dataset % index vector i-th test dataset % D_train,i, i-th training set% D_test,i, i-th test set
% space of train indices of size n_train % space of train indices of size n_train % space of train indices of size n_test % output vector associated to index J % def of the output vector associated to index J % cali-J, set of all splits % (Jtrain_1,Jtest_1) …(Jtrain_B,Jtest_B) % Generalization error % GE % GE-hat % GE full % GE hat holdout % GE hat holdout i-th set % GE-hat(lam) % GE-hat_I,J,rho(lam) % GE-hat_I,J,rho(lam) % GE formal def % aggregate function % GE of a fitted model % GEh of a fitted model % GE of a fitted model wrt loss L % pointwise loss of fitted model% GE of a fitted model % GE of inducer % GE indexed with data % expected GE % expectation wrt data of size n
% performance measure % perf. measure derived from pointwise loss % matrix of prediction scores % i-th row vector of the predscore mat % predscore mat idxvec J % predscore mat idxvec J and model f % predscore mat idxvec Jtest and model f hat % predscore mat idxvec Jtest and model f% predscore mat i-th idxvec Jtest and model f % def of predscore mat idxvec J and model f % Set of all datasets times HP space
% ml - ROC % no. of positive instances % no. of negative instances % proportion negative instances % proportion negative instances % true/false pos/neg: % true pos % false pos (fp taken for partial derivs) % true neg % false neg
% ml - trees, extra trees % (Parent) node N % node N_k % Left node N_1 % Right node N_2 % class probability node N % estimated class probability node N % estimated class probability left node% estimated class probability right node
% ml - bagging, random forest % baselearner, default m % estimated base learner, default m % baselearner, default m % ensembled predictor % estimated ensembled predictor % ambiguity/instability of ensemble % weight of basemodel m% weight of basemodel m with hat % last baselearner
% ml - boosting % prediction in iteration m % prediction in iteration m % prediction m-1 % prediction m-1 % weighted in-sample misclassification rate % weight vector of basemodel m % weight of obs i of basemodel m % parameters of basemodel m % parameters of basemodel m with hat % baselearner, default m % ensemble % pseudo residuals % pseudo residuals % terminal-region % terminal-region % mean, terminal-regions % mean, terminal-regions with hat% mean, terminal-regions
% ml - boosting iml lecture % theta % BL j with theta % BL j with theta $$
Hint: Useful libraries
Exercise 1: HRO in coding frameworks
Throughout the lecture, we will frequently use the R package mlr3, resp. the Python package sklearn, and its descendants, providing an integrated ecosystem for all common machine learning tasks. Let’s recap the HRO principle and see how it is reflected in either mlr3 or sklearn. An overview of the most important objects and their usage, illustrated with numerous examples, can be found at the mlr3 book and the scikit documentation.
- How are the key concepts (i.e., hypothesis space, risk and optimization) you learned about in the lecture videos implemented?
Solution
- Have a look at
mlr3::tsk("iris")/sklearn.datasets.load_iris. What attributes does this object store?
Solution
- Instantiate a regression tree learner (
lrn("regr.rpart")/DecisionTreeRegressor). What are the different settings for this learner?
Hint
mlr3::mlr_learners$keys() shows all available learners.
Use get_params() to see all available settings.
Solution
Exercise 2: Loss functions for regression tasks
In this exercise, we will examine loss functions for regression tasks somewhat more in depth.
- Consider the above linear regression task. How will the model parameters be affected by adding the new outlier point (orange) if you use \(L1\) loss and \(L2\) loss, respectively, in the empirical risk? (You do not need to actually compute the parameter values.)
Solution
\(L2\) loss penalizes vertical distances to the regression line quadratically, while \(L1\) only considers the absolute distance. As the outlier point lies pretty far from the remaining training data, it will have a large loss with \(L2\), and the regression line will pivot to the bottom right to minimize the resulting empirical risk. A model trained with \(L1\) loss is less susceptible to the outlier and will adjust only slightly to the new data.
Warning: Computation failed in `stat_quantile()`.
Caused by error in `compute_group()`:
! The package "quantreg" is required for `stat_quantile()`.
- The second plot visualizes another loss function popular in regression tasks, the so-called Huber loss (depending on \(\epsilon > 0\); here: \(\epsilon = 5\)). Describe how the Huber loss deals with residuals as compared to \(L1\) and \(L2\) loss. Can you guess its definition?
Solution
The Huber loss combines the respective advantages of \(L1\) and \(L2\) loss: it is smooth and (once) differentiable like \(L2\) but does not punish larger residuals as severely, leading to more robustness. It is simply a (weighted) piecewise combination of both losses, where \(\epsilon\) marks where \(L2\) transits to \(L1\) loss. The exact definition is:
\[ L\left(y, f(\mathbf{x})\right)= \begin{cases} \frac{1}{2}(y - f(\mathbf{x}))^2 & \text{ if } |y - f(\mathbf{x})|\le \epsilon \\ \epsilon |y - f(\mathbf{x})|-\frac{1}{2}\epsilon^2 \quad & \text{ otherwise } \end{cases}, \quad \epsilon > 0 \]
In the plot we can see how the parabolic shape of the loss around 0 evolves into an absolute-value function at \(|y - f(\mathbf{x})|> \epsilon = 5\).Exercise 3: Polynomial regression
Assume the following (noisy) data-generating process from which we have observed 50 realizations: \[y = -3 + 5 \cdot \sin(0.4 \pi x) + \epsilon\] with \(\epsilon \, \sim \mathcal{N}(0, 1)\).
- We decide to model the data with a cubic polynomial (including intercept term). State the corresponding hypothesis space.
Solution
Cubic means degree 3, so our hypothesis space will look as follows:
\[ \mathcal{H}= \{ f(\mathbf{x}~|~ \boldsymbol{\theta})= \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 ~|~ (\theta_0, \theta_1, \theta_2, \theta_3)^\top \in \mathbb{R}^4 \} \]
- State the empirical risk w.r.t. \(\boldsymbol{\theta}\) for a member of the hypothesis space. Use \(L2\) loss and be as explicit as possible.
Solution
The empirical risk is:
\[ \mathcal{R}_{\text{emp}}(\boldsymbol{\theta})= \sum_{i = 1}^{50} \left(y^{(i)}- \left[ \theta_0 + \theta_1 x^{(i)} + \theta_2 \left( x^{(i)} \right)^2 + \theta_3 \left( x^{(i)} \right)^3 \right] \right)^2 \]
- We can minimize this risk using gradient descent. Derive the gradient of the empirical risk w.r.t \(\boldsymbol{\theta}\).
Solution
We can find the gradient just as we did for an intermediate result when we derived the least-squares estimator:
\[\begin{align*} \nabla_{\boldsymbol{\theta}} \mathcal{R}_{\text{emp}}(\boldsymbol{\theta})&= \frac{\partial{}}{\partial \boldsymbol{\theta}} \left \| \mathbf{y}- \mathbf{X}\boldsymbol{\theta}\right \|_2^2 \\ &= \frac{\partial{}}{\partial \boldsymbol{\theta}} \left( \left(\mathbf{y}- \mathbf{X}\boldsymbol{\theta}\right)^\top \left(\mathbf{y}- \mathbf{X}\boldsymbol{\theta}\right) \right) \\ &= - 2 \mathbf{y}^\top \mathbf{X}+ 2 \boldsymbol{\theta}^\top \mathbf{X}^\top \mathbf{X}\\ &= 2 \cdot \left(-\mathbf{y}^\top \mathbf{X}+ \boldsymbol{\theta}^\top \mathbf{X}^\top \mathbf{X}\right) \end{align*}\]
- Using the result for the gradient, explain how to update the current parameter \(\boldsymbol{\theta}^{[t]}\) in a step of gradient descent.
Solution
Recall that the idea of gradient descent (!) is to traverse the risk surface in the direction of the gradient as we are in search for the minimum. Therefore, we will update our current parameter set \(\boldsymbol{\theta}^{[t]}\) with the negative gradient of the current empirical risk w.r.t. \(\boldsymbol{\theta}\), scaled by learning rate (or step size) \(\alpha\):
\[ \boldsymbol{\theta}^{[t +1]}= \boldsymbol{\theta}^{[t]}- \alpha \cdot \nabla_{\boldsymbol{\theta}} \mathcal{R}_{\text{emp}}(\boldsymbol{\theta}^{[t]}). \]
What actually happens here: we update each component of our current parameter vector \(\boldsymbol{\theta}^{[t]}\) in the of the negative gradient, i.e., following the steepest downward slope, and also by an that depends on the value of the gradient.
In order to see what that means it is helpful to recall that the gradient \(\nabla_{\boldsymbol{\theta}} \mathcal{R}_{\text{emp}}(\boldsymbol{\theta})\) tells us about the effect (infinitesimally small) changes in \(\boldsymbol{\theta}\) have on \(\mathcal{R}_{\text{emp}}(\boldsymbol{\theta})\). Therefore, gradient updates focus on influential components, and we proceed more quickly along the important dimensions.- You will not be able to fit the data perfectly with a cubic polynomial. Describe the advantages and disadvantages that a more flexible model class would have. Would you opt for a more flexible learner?
Solution
We see that, for example, the first model in exercise b) fits the data fairly well but not perfectly. Choosing a more flexible function (a polynomial of higher degree or a function from an entirely different, more complex, model class) might be advantageous:
We would be able to trace the observations more closely if our function were less smooth, and thus reduce empirical risk. On the other hand, flexibility also has drawbacks:
Flexible model classes often have more parameters, making training harder.
We might run into a phenomenon called overfitting. Recall that our ultimate goal is to make predictions on new observations. However, fitting every quirk of the training observations – possibly caused by imprecise measurement or other factors of randomness/error – will not generalize so well to new data.
In the end, we need to balance model fit and generalization. We will discuss the choice of hypotheses quite a lot since it is one of the most crucial design decisions in machine learning.
Exercise 4: Predicting abalone
We want to predict the age of an abalone using its longest shell measurement and its weight. The abalone data can be found here: https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data.
Prepare the data as follows:
- Plot
LongestShellandWholeWeighton the \(x\)- and \(y\)-axis, respectively, and color points according toRings.
Solution
- Using
mlr3/sklearn, fit a linear regression model to the data.
Solution
- Compare the fitted and observed targets visually.
R Hint
Use $autoplot() from mlr3viz.
Solution
We see a scatterplot of prediction vs true values, where the small bars along the axes (a so-called rugplot) indicate the number of observations that fall into this area. As we might have suspected from the first plot, the underlying relationship is not exactly linear (ideally, all points and the resulting line should lie on the diagonal). With a linear model we tend to underestimate the response.
- Assess the model’s training loss in terms of MAE.
Hint
Call $score(), which accepts different mlr_measures, on the prediction object.
Call from sklearn.metrics import mean_absolute_error.